27 research outputs found
Compressed multiple pattern matching
Peer reviewe
Lempel-Ziv Parsing for Sequences of Blocks
The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz lognz ). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb = O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.Funding: This research was funded by the Ministry of Science and Higher Education of the Russian Federation (Ural Mathematical Center project No. 075-02-2021-1387)
Detecting One-variable Patterns
Given a pattern such that
, where is a
variable and its reversal, and
are strings that contain no variables, we describe an
algorithm that constructs in time a compact representation of all
instances of in an input string of length over a polynomially bounded
integer alphabet, so that one can report those instances in time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201
Palindromic Length of Words with Many Periodic Palindromes
The palindromic length of a finite word is the minimal
number of palindromes whose concatenation is equal to . In 2013, Frid,
Puzynina, and Zamboni conjectured that: If is an infinite word and is
an integer such that for every factor of then
is ultimately periodic.
Suppose that is an infinite word and is an integer such
for every factor of . Let be the set
of all factors of that have more than
palindromic prefixes. We show that is an infinite set and we show
that for each positive integer there are palindromes and a word such that is a factor of and is nonempty. Note
that is a periodic word and is a palindrome for each . These results justify the following question: What is the palindromic
length of a concatenation of a suffix of and a periodic word with
"many" periodic palindromes?
It is known that ,
where and are nonempty words. The main result of our article shows that
if are palindromes, is nonempty, is a nonempty suffix of ,
is the minimal period of , and is a positive integer
with then
Lempel–Ziv-Like Parsing in Small Space
Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)
Palindromic Decompositions with Gaps and Errors
Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \delta-palindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201
Palk is linear recognizable online
Given a language L that is online recognizable in linear time and space, we construct a linear time and space online recognition algorithm for the language L・Pal, where Pal is the language of all nonempty palindromes. Hence for every fixed positive k, Palk is online recognizable in linear time and space. Thus we solve an open problem posed by Galil and Seiferas in 1978. © Springer-Verlag Berlin Heidelberg 2015
Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries
Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Linear-time algorithms computing runs and
preprocessing for constant-time LCE queries have been known for over a decade.
However, these algorithms assume a linearly-sortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
-time algorithm for answering LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to time. In this work we note a special \emph{non-crossing} property
of LCE queries asked in the runs computation. We show that any such
non-crossing queries can be answered on-line in time, which
yields an -time algorithm for computing runs
Run compressed rank/select for large alphabets
Given a string of length n that is composed of r runs of letters from the alphabet 0,1,..,σ-1 such that 2 ≤ σ ≤ r, we describe a data structure that, provided r ≤ n/log ω(1) n, stores the string in rlog nσ/r + o(r log nσ/r) bits and supports select and access queries in O(log log(n/r)/loglogn) time and rank queries in O(log log(nσ/r)/log time. We show that r log n(σ-1)/r-O(log n/r) bits are necessary for any such data structure and, thus, our solution is succinct. We also describe a data structure that uses (1 + ϵ)r log nσ/r + O(r) bits, where ϵ > 0 is an arbitrary constant, with the same query times but without the restriction r ≤ n/log ω(1) n. By simple reductions to the colored predecessor problem, we show that the query times are optimal in the important case r ≥ 2logδ n, for an arbitrary constant δ > 0. We implement our solution and compare it with the state of the art, showing that the closest competitors consume 31-46% more space. © 2018 IEEE.Peer reviewe
Cold intense electron beams from LN2-cooled GaAs-photocathodes
To study electron-ion interactions at the Heidelberg heavy-ion storage ring, electron beams with low-energy spreads and dc-currents of milliamperes are desired. Measurements of the photoelectron energy distribution showed that electron beams with energy spreads of 5-8 meV can be obtained from GaAs photocathodes, cooled to about LN2-temperature. However, in order to get milliamperes beam currents, the laser illumination has to be increased up to 1 W, causing substantial cathode heating. The presented new electron gun design based on sapphire-substrate transmission-mode photocathodes, cooled by LN2, stabilizes the GaAs bulk temperature under 1 W laser illumination at about 95 K and thereby provides the prerequisites for an electron gun being operated at milliampere-currents with low-energy spreads